DataJoint 2.0 #1311

dimitri-yatsenko · 2026-01-07T16:28:35Z

Summary

DataJoint 2.0 is a major release that modernizes the entire codebase while maintaining backward compatibility for core functionality. This release focuses on extensibility, type safety, and developer experience.

Planning: DataJoint 2.0 Plan | Milestone 2.0

Major Features

Codec System (Extensible Types)

Replaces the adapter system with a modern, composable codec architecture:

Base codecs: <blob>, <json>, <attach>, <filepath>, <object>, <hash>
Chaining: Codecs can wrap other codecs (e.g., <blob> wraps <json> for external storage)
Auto-registration: Custom codecs register via __init_subclass__
Validation: Optional validate() method for type checking before insert

from datajoint import Codec

class MyCodec(Codec):
    python_type = MyClass
    dj_type = "<blob>"  # Storage format
    
    def encode(self, value): ...
    def decode(self, value): ...

Semantic Matching

Attribute lineage tracking ensures joins only match semantically compatible attributes:

Attributes track their origin through foreign key inheritance
Joins require matching lineage (not just matching names)
Prevents accidental matches on generic names like id or name
semantic_check=False for legacy permissive behavior

# These join on subject_id because both inherit from Subject
Session * Recording  # ✓ Works - same lineage

# These fail because 'id' has different origins
TableA * TableB  # ✗ Fails - different lineage for 'id'

Primary Key Rules

Rigorous primary key propagation through all operators:

Join: Result PK based on functional dependencies (A→B, B→A, both, neither)
Aggregation: Groups by left operand's primary key
Projection: Preserves PK attributes, drops secondary
Universal set: dj.U('attr') creates ad-hoc grouping entities

AutoPopulate 2.0 (Jobs System)

Per-table job management with enhanced tracking:

Hidden metadata: ~~_job_timestamp and ~~_job_duration columns
Per-table jobs: Each computed table has its own ~~table_name job table
Schema.jobs: List all job tables in a schema
Progress tracking: table.progress() returns (remaining, total)
Priority scheduling: Jobs ordered by priority, then timestamp

Modern Fetch & Insert API

New fetch methods:

to_dicts() - List of dictionaries
to_pandas() - DataFrame with PK as index
to_arrays(*attrs) - NumPy arrays (structured or individual)
keys() - Primary keys only
fetch1() - Single row

Insert improvements:

validate() - Check rows before inserting
chunk_size - Batch large inserts
insert_dataframe() - DataFrame with index handling
Empty inserts for tables with all-default attributes (IMPR: Error specificity on empty insert for table with default values #1280)
Polars and PyArrow support

Type Aliases

Core DataJoint types for portability:

Alias	MySQL Type
`int8`, `int16`, `int32`, `int64`	tinyint, smallint, int, bigint
`uint8`, `uint16`, `uint32`, `uint64`	unsigned variants
`float32`, `float64`	float, double
`bool`	tinyint
`uuid`	binary(16)

Object Storage

Content-addressed and object storage types:

<hash> - Content-addressed storage with deduplication
<object> - Named object storage (Zarr, folders)
<filepath> - Reference to managed files
<attach> - File attachments (uploaded on insert)

Virtual Schema Infrastructure (#1307)

New schema introspection API for exploring existing databases:

Schema.get_table(name) - Direct table access with auto tier prefix detection
Schema['TableName'] - Bracket notation access
for table in schema - Iterate tables in dependency order
'TableName' in schema - Check table existence
dj.virtual_schema() - Clean entry point for accessing schemas
dj.VirtualModule() - Virtual modules with custom names

CLI Improvements

The dj command-line interface for interactive exploration:

dj -s schema:alias - Load schemas as virtual modules
--host, --user, --password - Connection options
Fixed -h conflict with --help

Settings Modernization

Pydantic-based configuration with validation:

Type-safe settings with automatic validation
dj.config.override() context manager
Secrets directory support (.secrets/)
Environment variable overrides (DJ_HOST, etc.)

License Change

Changed from LGPL to Apache 2.0 license (#1235 (discussion)):

More permissive for commercial and academic use
Compatible with broader ecosystem of tools
Clearer patent grant provisions

Breaking Changes

Removed Support

Python 3.8, 3.9 (minimum 3.10)
MySQL 5.x (minimum 8.0)
Legacy fetch() with format parameter
safemode parameter (use prompt)
Adapter API (use Codec)
create_virtual_module (use dj.virtual_schema() or dj.VirtualModule())
~log table (IMPR: Deprecate and Remove the ~log Table. #1298)
otumat support (IMPR: Deprecate otumat support #1252)

API Changes

fetch() → to_dicts(), to_pandas(), to_arrays()
fetch(format='frame') → to_pandas()
fetch(as_dict=True) → to_dicts()
safemode=False → prompt=False

Semantic Changes

Joins now require lineage compatibility by default
Aggregation keeps non-matching rows by default (like LEFT JOIN)

Documentation

Developer Documentation (this repo)

Comprehensive updates in docs/:

NumPy-style docstrings for all public APIs
Architecture guides for contributors
Auto-generated API reference via mkdocstrings

User Documentation (datajoint-docs)

Full documentation site following the Diátaxis framework:

Tutorials (learning-oriented, Jupyter notebooks):

Getting Started - Installation, connection, first schema
Schema Design - Table tiers, definitions, foreign keys
Data Entry - Insert patterns, lookups, manual tables
Queries - Restriction, projection, join, aggregation, fetch
Computation - Computed tables, make(), populate patterns
Object Storage - Blobs, attachments, external storage

How-To Guides (task-oriented):

Configure object storage, Design primary keys, Model relationships
Handle computation errors, Manage large datasets, Create custom codecs
Use the CLI, Migrate from 1.x

Reference (specifications):

Table Declaration, Query Algebra, Data Manipulation
Primary Keys, Semantic Matching, Type System, Virtual Schemas
Codec API, AutoPopulate, Fetch API, Job Metadata

Project Structure

src/ layout for proper packaging (IMPR: src layout #1267)
Testcontainers for pytest-managed containers
Pre-commit hooks: ruff, mypy, unit tests (IMPR: Modernize pre-commit #1271)
GitHub Actions CI/CD
Split unit/integration tests (IMPR: split unit/integration test #1211)

Test Plan

580+ integration tests pass
80+ unit tests pass
Pre-commit hooks pass
Documentation builds successfully
Tutorials execute against test database

Closes

Related PRs

datajoint-docs PR #97 - DataJoint 2.0 Documentation
datajoint-docs PR #98 - Virtual schemas spec and CLI docs

Migration Guide

See How to Migrate from 1.x for detailed migration instructions.

🤖 Generated with Claude Code

update test workflow to use src layout

use pytest to manage docker container startup for tests

Chore/dev env fixes

into impr/modernize-pre-commit

Impr/modernize pre commit

switch settings to use pydantic-settings

…tings-management-adsuG

- Remove ConfigWrapper class and backward compatibility layer - Use direct pydantic BaseSettings with typed nested models - Change context manager from config(...) to config.override(...) - Add validate_assignment=True for runtime type checking - Improve store spec validation with clear error messages - Update all tests to use new API - Preserve dict-style access via __getitem__/__setitem__ for convenience

Config file search: - Search for datajoint.json recursively up from cwd - Stop at .git/.hg boundaries or filesystem root - Warn if no config file found (instead of silently using defaults) Secrets management: - Add .secrets/ directory support (next to datajoint.json) - Support /run/secrets/datajoint/ for Docker/Kubernetes - Use SecretStr for password and aws_secret_access_key - Secrets masked in repr/logs, excluded from save() - Dict access automatically unwraps SecretStr for compatibility Breaking changes: - Config file renamed from dj_local_conf.json to datajoint.json - No more ~/.datajoint_config.json (project-only config) - Secrets should be in env vars or .secrets/ directory

Add convenient type aliases that map to MySQL types: - float32 -> float - float64 -> double - int32 -> int - uint32 -> int unsigned - int16 -> smallint - uint16 -> smallint unsigned - int8 -> tinyint - uint8 -> tinyint unsigned These aliases follow the same pattern as UUID, storing the original type in the column comment for round-tripping.

- int64 -> bigint - uint64 -> bigint unsigned

Add comprehensive tests for the new type aliases feature: - Pattern matching tests for all 10 type aliases - MySQL type mapping verification - Table creation with type aliases - Insert and fetch operations - Primary key usage with type aliases - Nullable column support

Makes tables more compact in notebook displays. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Co-Authored-By: Claude Opus 4.5 <[email protected]>

Documentation is now consolidated in datajoint-docs repository. Changes: - Delete docs/ folder (legacy MkDocs infrastructure) - Create ARCHITECTURE.md with transpiler design docs - Update README.md links to point to docs.datajoint.com The Developer Guide remains in README.md. Internal architecture documentation for contributors is now in ARCHITECTURE.md. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Transpilation documentation moved to datajoint-docs query-algebra spec. Developer docs now consolidated in README.md. Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Rename CHANGELOG.md to CHANGELOG-archive.md with redirect to GitHub Releases - Add "Writing Release Notes" section to RELEASE_MEMO.md: - Categories (BREAKING, Added, Changed, Deprecated, Fixed, Security) - Format template with examples - Guidelines for good release notes - PR label mapping for release drafter Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Slim README.md to essentials (intro, badges, install, links) - Create CONTRIBUTING.md with: - Development setup (pixi and pip) - Test running instructions - Pre-commit hooks - Environment variables - Condensed docstring style guide - Delete DOCSTRING_STYLE.md (merged into CONTRIBUTING.md) README: 218 → 82 lines All detailed docs now at docs.datajoint.com Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Add .github/DISCUSSION_TEMPLATE/rfc.yml for enhancement proposals - Fix table header alignment (center instead of right) - Fix excessive padding in table headers by removing p tag margins Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Raw blobs (no codec) now show "bytes" - Raw json (no codec) shows "json" - Codec fields show "<codec_name>" (e.g., <blob>, <hash>, <object>) - HTML output properly escapes angle brackets for browser display - Improves clarity when viewing table contents Example output: *id raw_blob blob_data json_data 1 bytes <blob> json Co-Authored-By: Claude Opus 4.5 <[email protected]>

Add migrate_external() and migrate_filepath() to datajoint.migrate module for safe migration of 0.x external storage columns to 2.0 JSON format. Migration strategy: 1. Add new <column>_v2 columns with JSON type 2. Copy and convert data from old columns 3. User verifies data accessible via DataJoint 2.0 4. Finalize: rename columns (old → _v1, new → original) This allows 0.x and 2.0 to coexist during migration and provides rollback capability if issues are discovered. Functions: - migrate_external(schema, dry_run=True, finalize=False) - migrate_filepath(schema, dry_run=True, finalize=False) - _find_external_columns(schema) - detect 0.x external columns - _find_filepath_columns(schema) - detect 0.x filepath columns Co-Authored-By: Claude Opus 4.5 <[email protected]>

Implement the `<npy@>` codec for schema-addressed numpy array storage: - Add SchemaCodec base class for path-addressed storage codecs - Add NpyRef class for lazy array references with metadata - Add NpyCodec using .npy format with shape/dtype inspection - Refactor ObjectCodec to inherit from SchemaCodec - Rename is_external to is_store throughout codebase - Export SchemaCodec and NpyRef from public API - Bump version to 2.0.0a17 Key features: - Lazy loading: inspect shape/dtype without downloading - NumPy integration via __array__ protocol - Safe bulk fetch: returns NpyRef objects, not arrays - Schema-addressed paths: {schema}/{table}/{pk}/{attr}.npy Co-Authored-By: Claude Opus 4.5 <[email protected]>

The SchemaCodec (used by NpyCodec and ObjectCodec) needs _schema, _table, _field, and primary key values to construct schema-addressed storage paths. Previously, key=None was passed, resulting in "unknown/unknown" paths. Now builds proper context dict from table metadata and row values, enabling navigable paths like: {schema}/{table}/objects/{pk_path}/{attribute}.npy Co-Authored-By: Claude Opus 4.5 <[email protected]>

…to feature/npy-codec

Merge PR #1330 (blob preview display) into feature/npy-codec. Bump version from 2.0.0a17 to 2.0.0a18. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Address reviewer feedback from PR #1330: attr should never be None since field_name comes from heading.names. Raising an error surfaces bugs immediately rather than silently returning a misleading placeholder. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Show codec names in table preview instead of =BLOB=

Support memory-mapped loading for large arrays: - Local filesystem stores: mmap directly, no download - Remote stores: download to cache, then mmap Co-Authored-By: Claude Opus 4.5 <[email protected]>

…orage Major changes to hash-addressed storage model: - Rename content_registry.py → hash_registry.py for clarity - Always store full path in metadata (protects against config changes) - Use stored path directly for retrieval (no path regeneration) - Add delete_path() as primary function, deprecate delete_hash() - Add get_size() as primary function, deprecate get_hash_size() - Update gc.py to work with paths instead of hashes - Update builtin_codecs.py HashCodec to use new API This design enables seamless migration from v0.14: - Legacy data keeps old paths in metadata - New data uses new path structure - GC compares stored paths against filesystem Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Remove uuid_from_buffer from hash.py (dead code) - connection.py now uses hashlib.md5().hexdigest() directly - Update test_hash.py to test key_hash instead Co-Authored-By: Claude Opus 4.5 <[email protected]>

Remove dead code that was only tested but never used in production: - hash_exists (gc uses set operations on paths) - delete_hash (gc uses delete_path directly) - get_size (gc collects sizes during walk) - get_hash_size (wrapper for get_size) Remaining API: compute_hash, build_hash_path, get_store_backend, get_store_subfolding, put_hash, get_hash, delete_path Co-Authored-By: Claude Opus 4.5 <[email protected]>

feat: Add NpyCodec for lazy-loading numpy arrays

MilagrosMarin · 2026-01-13T23:16:06Z

src/datajoint/__init__.py

+    "errors",
+    "migrate",
+    "DataJointError",
+    "key",


@dimitri-yatsenko The datajoint-docs state that dj.key is removed in 2.0 (see https://github.com/datajoint/datajoint-docs/blob/pre/v2.0/src/reference/specs/fetch-api.md#removed-methods-and-parameters), but' key' is still listed in __all__ in here. Should key be removed from the exports here, or does the documentation need to be updated?

d-v-b and others added 30 commits September 16, 2025 20:55

lint with ruff

1aa30f4

update linting workflow

de4ce27

update test workflow to use src layout

59d0159

update test workflow to use src layout

85ff041

update hook invocations to use src layout

2007f33

Merge pull request #1274 from d-v-b/fix/unbreak-test-workflow

9fcb25a

update test workflow to use src layout

Merge pull request #1269 from d-v-b/feat/pytest-container-management

896e6cd

use pytest to manage docker container startup for tests

simplify devcontainer

b3b712b

update deps, and add activate script for dot

a506d40

refactor test fixtures

88ca4dc

skip multiprocessing tests on osx

a30d41b

skip c901 check

d631b8b

update pre-commit

f45e7c8

Merge pull request #1279 from d-v-b/chore/dev-env-fixes

4893e3d

Chore/dev env fixes

Merge branch 'pre/v2.0' of https://github.com/datajoint/datajoint-python

66c9ebd

into impr/modernize-pre-commit

more linting

b00a4f0

sync pyproject.toml

65b701a

fix long matlab blobs

908d226

Merge pull request #1273 from d-v-b/impr/modernize-pre-commit

cb4c128

Impr/modernize pre commit

switch settings to use pydantic-settings

ddae20d

remove unnecessary class

80d6d6f

Merge pull request #1281 from d-v-b/chore/use-pydantic-settings

defebec

switch settings to use pydantic-settings

Merge remote-tracking branch 'origin/pre/v2.0' into claude/review-set…

a268bd3

…tings-management-adsuG

fix: remove unused imports (ruff)

df45286

style: apply ruff-format changes

898c5c2

feat: add int64 and uint64 type aliases

864121d

- int64 -> bigint - uint64 -> bigint unsigned

dimitri-yatsenko added bug Indicates an unexpected problem or unintended behavior breaking Not backward compatible changes labels Jan 9, 2026

style: Reduce query result table font size to 75%

23967f4

Makes tables more compact in notebook displays. Co-Authored-By: Claude Opus 4.5 <[email protected]>

github-actions bot removed bug Indicates an unexpected problem or unintended behavior breaking Not backward compatible changes labels Jan 10, 2026

dimitri-yatsenko and others added 24 commits January 9, 2026 18:28

chore: Remove accidentally committed config files

90e5c17

Co-Authored-By: Claude Opus 4.5 <[email protected]>

docs: Use uint16 instead of native int in codec examples

5db3359

Co-Authored-By: Claude Opus 4.5 <[email protected]>

chore: Remove accidentally committed config files

7269d44

Co-Authored-By: Claude Opus 4.5 <[email protected]>

chore: Add .secrets and datajoint.json to gitignore

8456f39

Co-Authored-By: Claude Opus 4.5 <[email protected]>

refactor: Remove ARCHITECTURE.md, content moved to docs spec

5d2ba05

Transpilation documentation moved to datajoint-docs query-algebra spec. Developer docs now consolidated in README.md. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Merge remote-tracking branch 'origin/enhance/blob-preview-display' in…

14d3da6

…to feature/npy-codec

chore: Merge enhance/blob-preview-display and bump to 2.0.0a18

9f6826e

Merge PR #1330 (blob preview display) into feature/npy-codec. Bump version from 2.0.0a17 to 2.0.0a18. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Merge pull request #1330 from datajoint/enhance/blob-preview-display

4f6625b

Show codec names in table preview instead of =BLOB=

Merge remote-tracking branch 'origin/pre/v2.0' into feature/npy-codec

6b951d4

feat: Add mmap_mode parameter to NpyRef.load()

12ea814

Support memory-mapped loading for large arrays: - Local filesystem stores: mmap directly, no download - Remote stores: download to cache, then mmap Co-Authored-By: Claude Opus 4.5 <[email protected]>

fix: Remove unused variable in mmap test

c02a882

refactor: Remove uuid_from_buffer, use hashlib directly for query cache

d2ab4de

- Remove uuid_from_buffer from hash.py (dead code) - connection.py now uses hashlib.md5().hexdigest() directly - Update test_hash.py to test key_hash instead Co-Authored-By: Claude Opus 4.5 <[email protected]>

Merge pull request #1331 from datajoint/feature/npy-codec

471b8a9

feat: Add NpyCodec for lazy-loading numpy arrays

MilagrosMarin reviewed Jan 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DataJoint 2.0 #1311

DataJoint 2.0 #1311

dimitri-yatsenko commented Jan 7, 2026 •

edited

Loading

Uh oh!

MilagrosMarin Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

DataJoint 2.0 #1311

Are you sure you want to change the base?

DataJoint 2.0 #1311

Conversation

dimitri-yatsenko commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Major Features

Codec System (Extensible Types)

Semantic Matching

Primary Key Rules

AutoPopulate 2.0 (Jobs System)

Modern Fetch & Insert API

Type Aliases

Object Storage

Virtual Schema Infrastructure (#1307)

CLI Improvements

Settings Modernization

License Change

Breaking Changes

Removed Support

API Changes

Semantic Changes

Documentation

Developer Documentation (this repo)

User Documentation (datajoint-docs)

Project Structure

Test Plan

Closes

Milestone 2.0 Issues

Bug Fixes

Improvements

Related PRs

Migration Guide

Uh oh!

MilagrosMarin Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dimitri-yatsenko commented Jan 7, 2026 •

edited

Loading